Construct Non-Hierarchical P/NBD Model for Online Retail Transaction Data

Author

Mick Cooney

Published

October 23, 2023

In this workbook we construct our first hierarchical P/NBD models on the synthetic data with the longer timeframe.

1 Load and Construct Datasets

We start by modelling the P/NBD model using our synthetic datasets before we try to model real-life data.

1.1 Load Online Retail Data

We now want to load the online retail transaction data.

Code
customer_cohortdata_tbl <- read_rds("data/onlineretail_cohort_tbl.rds")
customer_cohortdata_tbl |> glimpse()
Rows: 5,852
Columns: 5
$ customer_id     <chr> "12346", "12347", "12348", "12349", "12350", "12351", …
$ cohort_qtr      <chr> "2010 Q1", "2010 Q4", "2010 Q3", "2010 Q2", "2011 Q1",…
$ cohort_ym       <chr> "2010 03", "2010 10", "2010 09", "2010 04", "2011 02",…
$ first_tnx_date  <date> 2010-03-02, 2010-10-31, 2010-09-27, 2010-04-29, 2011-…
$ total_tnx_count <int> 3, 8, 5, 3, 1, 1, 9, 2, 1, 2, 6, 2, 5, 10, 6, 4, 10, 2…
Code
customer_transactions_tbl <- read_rds("data/onlineretail_transactions_tbl.rds")
customer_transactions_tbl |> glimpse()
Rows: 36,658
Columns: 4
$ tnx_timestamp <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:59, 2009-12-01 09…
$ customer_id   <chr> "13085", "13085", "13078", "15362", "18102", "12682", "1…
$ invoice_id    <chr> "489434", "489435", "489436", "489437", "489438", "48943…
$ tnx_amount    <dbl> 505.30, 145.80, 630.33, 310.75, 2286.24, 372.30, 50.40, …
Code
customer_subset_id <- read_rds("data/onlineretail_customer_subset_ids.rds")
customer_subset_id |> glimpse()
 chr [1:2000] "12347" "12348" "12349" "12355" "12357" "12359" "12360" ...

1.2 Load Derived Data

Code
customer_summarystats_tbl <- read_rds("data/onlineretail_customer_summarystats_tbl.rds")

obs_fitdata_tbl   <- read_rds("data/onlineretail_obs_fitdata_tbl.rds")
obs_validdata_tbl <- read_rds("data/onlineretail_obs_validdata_tbl.rds")

customer_fit_stats_tbl <- obs_fitdata_tbl |>
  rename(x = tnx_count)

1.3 Load Subset Data

We also want to construct our data subsets for the purposes of speeding up our valuations.

Code
customer_fit_subset_tbl <- obs_fitdata_tbl |>
  filter(customer_id %in% customer_subset_id)

customer_fit_subset_tbl |> glimpse()
Rows: 2,000
Columns: 6
$ customer_id    <chr> "12347", "12348", "12349", "12355", "12357", "12359", "…
$ first_tnx_date <dttm> 2010-10-31 14:19:59, 2010-09-27 14:58:59, 2010-04-29 1…
$ last_tnx_date  <dttm> 2010-10-31 14:19:59, 2010-09-27 14:58:59, 2010-10-28 0…
$ tnx_count      <dbl> 0, 0, 1, 0, 0, 5, 2, 2, 0, 0, 2, 2, 1, 1, 3, 0, 2, 4, 0…
$ t_x            <dbl> 0.000000, 0.000000, 25.970536, 0.000000, 0.000000, 44.1…
$ T_cal          <dbl> 4.3432540, 9.1965278, 30.7777778, 27.6429563, 2.0828373…
Code
customer_valid_subset_tbl <- obs_validdata_tbl |>
  filter(customer_id %in% customer_subset_id)

customer_valid_subset_tbl |> glimpse()
Rows: 2,000
Columns: 3
$ customer_id       <chr> "12347", "12348", "12349", "12355", "12357", "12359"…
$ tnx_count         <dbl> 7, 4, 1, 1, 1, 4, 3, 1, 0, 0, 0, 1, 1, 2, 4, 0, 0, 1…
$ tnx_last_interval <dbl> 53.09444, 42.65010, 50.77292, 22.79653, 48.66736, 45…

We now use these datasets to set the start and end dates for our various validation methods.

Code
dates_lst <- read_rds("data/onlineretail_simulation_dates.rds")

use_fit_start_date <- dates_lst$use_fit_start_date
use_fit_end_date   <- dates_lst$use_fit_end_date

use_valid_start_date <- dates_lst$use_valid_start_date
use_valid_end_date   <- dates_lst$use_valid_end_date

Finally, we need to set our directories where we save our Stan code and the model outputs.

Code
stan_modeldir <- "stan_models"
stan_codedir  <-   "stan_code"

2 Fit First P/NBD Model

We now construct our Stan model and prepare to fit it with our synthetic dataset.

We also want to set a number of overall parameters for this workbook

To start the fit data, we want to use the 1,000 customers. We also need to calculate the summary statistics for the validation period.

2.1 Compile and Fit Stan Model

We now compile this model using CmdStanR.

Code
pnbd_fixed_stanmodel <- cmdstan_model(
  "stan_code/pnbd_fixed.stan",
  include_paths =   stan_codedir,
  pedantic      =           TRUE,
  dir           =  stan_modeldir
  )

We then use this compiled model with our data to produce a fit of the data.

Code
stan_modelname <- "pnbd_onlineretail_fixed1"
stanfit_seed   <- stanfit_seed + 1
stanfit_prefix <- str_c("fit_", stan_modelname) 

stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")

stan_data_lst <- customer_fit_stats_tbl |>
  select(customer_id, x, t_x, T_cal) |>
  compose_data(
    lambda_mn = 0.25,
    lambda_cv = 1.00,
    
    mu_mn     = 0.10,
    mu_cv     = 1.00,
    )

if(!file_exists(stanfit_object_file)) {
  pnbd_onlineretail_fixed1_stanfit <- pnbd_fixed_stanmodel$sample(
    data            =                stan_data_lst,
    chains          =                            4,
    iter_warmup     =                          500,
    iter_sampling   =                          500,
    seed            =                 stanfit_seed,
    save_warmup     =                         TRUE,
    output_dir      =                stan_modeldir,
    output_basename =               stanfit_prefix,
    )
  
  pnbd_onlineretail_fixed1_stanfit$save_object(stanfit_object_file, compress = "gzip")

} else {
  pnbd_onlineretail_fixed1_stanfit <- read_rds(stanfit_object_file)
}
Running MCMC with 4 chains, at most 12 in parallel...

Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 1 finished in 170.7 seconds.
Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 3 finished in 172.3 seconds.
Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 4 finished in 175.4 seconds.
Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 2 finished in 176.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 173.6 seconds.
Total execution time: 176.5 seconds.
Code
pnbd_onlineretail_fixed1_stanfit$summary()
# A tibble: 12,718 × 10
   variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
   <chr>         <num>    <num>   <num>   <num>    <num>    <num> <num>    <num>
 1 lp__       -6.82e+4 -6.82e+4 71.9    71.7    -6.83e+4 -6.81e+4  1.01     704.
 2 lambda[1]   6.43e-2  5.20e-2  0.0497  0.0409  1.04e-2  1.59e-1  1.00    1962.
 3 lambda[2]   1.41e-1  9.80e-2  0.148   0.0972  6.83e-3  4.24e-1  1.00    1817.
 4 lambda[3]   1.33e-1  8.00e-2  0.160   0.0884  4.79e-3  4.41e-1  1.00    1351.
 5 lambda[4]   5.78e-2  4.77e-2  0.0408  0.0350  1.11e-2  1.37e-1  1.00    1626.
 6 lambda[5]   2.47e-1  1.61e-1  0.271   0.174   1.03e-2  7.39e-1  1.00    1200.
 7 lambda[6]   3.09e-1  2.59e-1  0.216   0.185   5.64e-2  7.29e-1  1.00    2131.
 8 lambda[7]   1.42e-1  9.21e-2  0.155   0.0979  7.57e-3  4.44e-1  1.00    1678.
 9 lambda[8]   1.35e-1  7.92e-2  0.161   0.0886  4.57e-3  4.50e-1  1.00    1374.
10 lambda[9]   2.70e-1  2.45e-1  0.154   0.140   7.44e-2  5.52e-1  1.00    1920.
# ℹ 12,708 more rows
# ℹ 1 more variable: ess_tail <num>

We have some basic HMC-based validity statistics we can check.

Code
pnbd_onlineretail_fixed1_stanfit$cmdstan_diagnose()
Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed1-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed1-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed1-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed1-4.csvWarning: non-fatal error reading adaptation data


Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

2.2 Visual Diagnostics of the Sample Validity

Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.

Code
parameter_subset <- c(
  "lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
  "mu[1]",     "mu[2]",     "mu[3]",     "mu[4]"
  )

pnbd_onlineretail_fixed1_stanfit$draws(inc_warmup = FALSE) |>
  mcmc_trace(pars = parameter_subset) +
  expand_limits(y = 0) +
  labs(
    x = "Iteration",
    y = "Value",
    title = "Traceplot of Sample of Lambda and Mu Values"
    ) +
  theme(axis.text.x = element_text(size = 10))

We also check \(N_{eff}\) as a quick diagnostic of the fit.

Code
pnbd_onlineretail_fixed1_stanfit |>
  neff_ratio(pars = c("lambda", "mu")) |>
  as.numeric() |>
  mcmc_neff() +
    ggtitle("Plot of Parameter Effective Sample Sizes")

2.3 Assess the Model

As we intend to run the same logic to assess each of our models, we have combined all this logic into a single function run_model_assessment, to run the simulations and combine the datasets.

Code
pnbd_stanfit <- pnbd_onlineretail_fixed1_stanfit |>
  recover_types(customer_fit_stats_tbl)

pnbd_onlineretail_fixed1_assess_data_lst <- run_model_assessment(
  model_stanfit    = pnbd_stanfit,
  insample_tbl     = customer_fit_subset_tbl,
  fit_label        = "pnbd_onlineretail_fixed1",
  fit_end_dttm     = use_fit_end_date     |> as.POSIXct(),
  valid_start_dttm = use_valid_start_date |> as.POSIXct(),
  valid_end_dttm   = use_valid_end_date   |> as.POSIXct(),
  sim_seed         = 10
  )

pnbd_onlineretail_fixed1_assess_data_lst |> glimpse()
List of 5
 $ model_fit_index_filepath     : 'glue' chr "data/pnbd_onlineretail_fixed1_assess_fit_index_tbl.rds"
 $ model_valid_index_filepath   : 'glue' chr "data/pnbd_onlineretail_fixed1_assess_valid_index_tbl.rds"
 $ model_simstats_filepath      : 'glue' chr "data/pnbd_onlineretail_fixed1_assess_model_simstats_tbl.rds"
 $ model_fit_simstats_filepath  : 'glue' chr "data/pnbd_onlineretail_fixed1_assess_fit_simstats_tbl.rds"
 $ model_valid_simstats_filepath: 'glue' chr "data/pnbd_onlineretail_fixed1_assess_valid_simstats_tbl.rds"

2.3.1 Check In-Sample Data Validation

We first check the model against the in-sample data.

Code
simdata_tbl <- pnbd_onlineretail_fixed1_assess_data_lst |>
  use_series(model_fit_simstats_filepath) |>
  read_rds()

insample_plots_lst <- create_model_assessment_plots(
  obsdata_tbl = obs_fitdata_tbl,
  simdata_tbl = simdata_tbl
  )

insample_plots_lst$multi_plot |> print()

Code
insample_plots_lst$total_plot |> print()

Code
insample_plots_lst$quant_plot |> print()

This fit looks reasonable and appears to capture most of the aspects of the data used to fit it. Given that this is a synthetic dataset, this is not surprising, but at least we appreciate that our model is valid.

2.3.2 Check Out-of-Sample Data Validation

We now repeat for the out-of-sample data.

Code
simdata_tbl <- pnbd_onlineretail_fixed1_assess_data_lst |>
  use_series(model_valid_simstats_filepath) |>
  read_rds()

outsample_plots_lst <- create_model_assessment_plots(
  obsdata_tbl = obs_validdata_tbl,
  simdata_tbl = simdata_tbl
  )

outsample_plots_lst$multi_plot |> print()

Code
outsample_plots_lst$total_plot |> print()

Code
outsample_plots_lst$quant_plot |> print()

As for our short time frame data, overall our model is working well.

           used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  3913101 209.0    8191587  437.5   8191587  437.5
Vcells 70189754 535.6  334342901 2550.9 417928169 3188.6

3 Fit Alternate Prior Model.

We want to try an alternate prior model with a smaller co-efficient of variation to see what impact it has on our procedures.

Code
stan_modelname <- "pnbd_onlineretail_fixed2"
stanfit_seed   <- stanfit_seed + 1
stanfit_prefix <- str_c("fit_", stan_modelname) 

stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")

stan_data_lst <- customer_fit_stats_tbl |>
  select(customer_id, x, t_x, T_cal) |>
  compose_data(
    lambda_mn = 0.25,
    lambda_cv = 0.50,
    
    mu_mn     = 0.10,
    mu_cv     = 0.50,
    )

if(!file_exists(stanfit_object_file)) {
  pnbd_onlineretail_fixed2_stanfit <- pnbd_fixed_stanmodel$sample(
    data            =                stan_data_lst,
    chains          =                            4,
    iter_warmup     =                          500,
    iter_sampling   =                          500,
    seed            =                 stanfit_seed,
    save_warmup     =                         TRUE,
    output_dir      =                stan_modeldir,
    output_basename =               stanfit_prefix,
    )
  
  pnbd_onlineretail_fixed2_stanfit$save_object(stanfit_object_file, compress = "gzip")

} else {
  pnbd_onlineretail_fixed2_stanfit <- read_rds(stanfit_object_file)
}
Running MCMC with 4 chains, at most 12 in parallel...

Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 4 finished in 141.8 seconds.
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 1 finished in 142.5 seconds.
Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 3 finished in 143.5 seconds.
Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 2 finished in 143.9 seconds.

All 4 chains finished successfully.
Mean chain execution time: 142.9 seconds.
Total execution time: 144.2 seconds.
Code
pnbd_onlineretail_fixed2_stanfit$summary()
# A tibble: 12,718 × 10
   variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
   <chr>         <num>    <num>   <num>   <num>    <num>    <num> <num>    <num>
 1 lp__       -1.44e+5 -1.44e+5 67.2    66.7    -1.44e+5 -1.44e+5  1.00     719.
 2 lambda[1]   1.32e-1  1.21e-1  0.0625  0.0582  4.95e-2  2.51e-1  1.00    3089.
 3 lambda[2]   2.11e-1  1.93e-1  0.108   0.0999  7.49e-2  4.12e-1  1.00    2902.
 4 lambda[3]   2.07e-1  1.86e-1  0.108   0.0955  7.00e-2  4.12e-1  1.00    3208.
 5 lambda[4]   1.09e-1  1.02e-1  0.0475  0.0431  4.30e-2  1.99e-1  1.00    3086.
 6 lambda[5]   2.49e-1  2.26e-1  0.129   0.119   7.89e-2  4.93e-1  1.00    4277.
 7 lambda[6]   2.69e-1  2.50e-1  0.119   0.113   1.09e-1  4.97e-1  1.00    3642.
 8 lambda[7]   2.09e-1  1.90e-1  0.109   0.103   6.30e-2  4.09e-1  1.00    3500.
 9 lambda[8]   2.12e-1  1.96e-1  0.109   0.101   6.60e-2  4.20e-1  1.01    3648.
10 lambda[9]   2.60e-1  2.44e-1  0.107   0.103   1.12e-1  4.61e-1  1.01    3962.
# ℹ 12,708 more rows
# ℹ 1 more variable: ess_tail <num>

We have some basic HMC-based validity statistics we can check.

Code
pnbd_onlineretail_fixed2_stanfit$cmdstan_diagnose()
Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed2-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed2-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed2-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed2-4.csvWarning: non-fatal error reading adaptation data


Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

3.1 Visual Diagnostics of the Sample Validity

Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.

Code
parameter_subset <- c(
  "lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
  "mu[1]",     "mu[2]",     "mu[3]",     "mu[4]"
  )

pnbd_onlineretail_fixed2_stanfit$draws(inc_warmup = FALSE) |>
  mcmc_trace(pars = parameter_subset) +
  expand_limits(y = 0) +
  labs(
    x = "Iteration",
    y = "Value",
    title = "Traceplot of Sample of Lambda and Mu Values"
    ) +
  theme(axis.text.x = element_text(size = 10))

We want to check the \(N_{eff}\) statistics also.

Code
pnbd_onlineretail_fixed2_stanfit |>
  neff_ratio(pars = c("lambda", "mu")) |>
  as.numeric() |>
  mcmc_neff() +
    ggtitle("Plot of Parameter Effective Sample Sizes")

3.2 Assess the Model

As we intend to run the same logic to assess each of our models, we have combined all this logic into a single function run_model_assessment, to run the simulations and combine the datasets.

Code
pnbd_stanfit <- pnbd_onlineretail_fixed2_stanfit |>
  recover_types(customer_fit_stats_tbl)

pnbd_onlineretail_fixed2_assess_data_lst <- run_model_assessment(
  model_stanfit    = pnbd_stanfit,
  insample_tbl     = customer_fit_subset_tbl,
  fit_label        = "pnbd_onlineretail_fixed2",
  fit_end_dttm     = use_fit_end_date     |> as.POSIXct(),
  valid_start_dttm = use_valid_start_date |> as.POSIXct(),
  valid_end_dttm   = use_valid_end_date   |> as.POSIXct(),
  sim_seed         = 20
  )

pnbd_onlineretail_fixed2_assess_data_lst |> glimpse()
List of 5
 $ model_fit_index_filepath     : 'glue' chr "data/pnbd_onlineretail_fixed2_assess_fit_index_tbl.rds"
 $ model_valid_index_filepath   : 'glue' chr "data/pnbd_onlineretail_fixed2_assess_valid_index_tbl.rds"
 $ model_simstats_filepath      : 'glue' chr "data/pnbd_onlineretail_fixed2_assess_model_simstats_tbl.rds"
 $ model_fit_simstats_filepath  : 'glue' chr "data/pnbd_onlineretail_fixed2_assess_fit_simstats_tbl.rds"
 $ model_valid_simstats_filepath: 'glue' chr "data/pnbd_onlineretail_fixed2_assess_valid_simstats_tbl.rds"

3.2.1 Check In-Sample Data Validation

We first check the model against the in-sample data.

Code
simdata_tbl <- pnbd_onlineretail_fixed2_assess_data_lst |>
  use_series(model_fit_simstats_filepath) |>
  read_rds()

insample_plots_lst <- create_model_assessment_plots(
  obsdata_tbl = obs_fitdata_tbl,
  simdata_tbl = simdata_tbl
  )

insample_plots_lst$multi_plot |> print()

Code
insample_plots_lst$total_plot |> print()

Code
insample_plots_lst$quant_plot |> print()

This fit looks reasonable and appears to capture most of the aspects of the data used to fit it. Given that this is a synthetic dataset, this is not surprising, but at least we appreciate that our model is valid.

3.2.2 Check Out-of-Sample Data Validation

We now repeat for the out-of-sample data.

Code
simdata_tbl <- pnbd_onlineretail_fixed2_assess_data_lst |>
  use_series(model_valid_simstats_filepath) |>
  read_rds()

outsample_plots_lst <- create_model_assessment_plots(
  obsdata_tbl = obs_validdata_tbl,
  simdata_tbl = simdata_tbl
  )

outsample_plots_lst$multi_plot |> print()

Code
outsample_plots_lst$total_plot |> print()

Code
outsample_plots_lst$quant_plot |> print()

            used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   3916063 209.2    8191589  437.5   8191589  437.5
Vcells 121157261 924.4  404769157 3088.2 505960818 3860.2

4 Fit Tight-Lifetime Model

We now want to try a model where we use priors with a tighter coefficient of variation for lifetime but keep the CoV for transaction frequency.

Code
stan_modelname <- "pnbd_onlineretail_fixed3"
stanfit_seed   <- stanfit_seed + 1
stanfit_prefix <- str_c("fit_", stan_modelname) 

stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")


stan_data_lst <- customer_fit_stats_tbl |>
  select(customer_id, x, t_x, T_cal) |>
  compose_data(
    lambda_mn = 0.25,
    lambda_cv = 1.00,
    
    mu_mn     = 0.10,
    mu_cv     = 0.50,
    )

if(!file_exists(stanfit_object_file)) {
  pnbd_onlineretail_fixed3_stanfit <- pnbd_fixed_stanmodel$sample(
    data            =                stan_data_lst,
    chains          =                            4,
    iter_warmup     =                          500,
    iter_sampling   =                          500,
    seed            =                 stanfit_seed,
    save_warmup     =                         TRUE,
    output_dir      =                stan_modeldir,
    output_basename =               stanfit_prefix,
    )
  
  pnbd_onlineretail_fixed3_stanfit$save_object(stanfit_object_file, compress = "gzip")

} else {
  pnbd_onlineretail_fixed3_stanfit <- read_rds(stanfit_object_file)
}
Running MCMC with 4 chains, at most 12 in parallel...

Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 4 finished in 174.0 seconds.
Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 2 finished in 178.6 seconds.
Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 3 finished in 183.6 seconds.
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 1 finished in 185.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 180.3 seconds.
Total execution time: 185.4 seconds.
Code
pnbd_onlineretail_fixed3_stanfit$summary()
# A tibble: 12,718 × 10
   variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
   <chr>         <num>    <num>   <num>   <num>    <num>    <num> <num>    <num>
 1 lp__       -1.12e+5 -1.12e+5 69.6    69.7    -1.12e+5 -1.12e+5  1.00     780.
 2 lambda[1]   7.15e-2  5.77e-2  0.0542  0.0429  1.15e-2  1.79e-1  1.00    2655.
 3 lambda[2]   1.49e-1  9.94e-2  0.162   0.105   7.06e-3  4.46e-1  1.00    1844.
 4 lambda[3]   1.34e-1  8.13e-2  0.158   0.0858  6.35e-3  4.43e-1  1.00    2267.
 5 lambda[4]   5.92e-2  5.00e-2  0.0415  0.0377  9.93e-3  1.39e-1  1.00    2612.
 6 lambda[5]   2.36e-1  1.69e-1  0.228   0.166   1.38e-2  6.98e-1  1.00    2197.
 7 lambda[6]   3.00e-1  2.55e-1  0.208   0.187   5.53e-2  7.02e-1  1.00    2496.
 8 lambda[7]   1.44e-1  9.25e-2  0.160   0.0985  6.70e-3  4.23e-1  1.00    2262.
 9 lambda[8]   1.34e-1  8.35e-2  0.154   0.0923  5.13e-3  4.34e-1  1.00    2776.
10 lambda[9]   2.69e-1  2.38e-1  0.158   0.144   6.92e-2  5.62e-1  1.00    2091.
# ℹ 12,708 more rows
# ℹ 1 more variable: ess_tail <num>

We have some basic HMC-based validity statistics we can check.

Code
pnbd_onlineretail_fixed3_stanfit$cmdstan_diagnose()
Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed3-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed3-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed3-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed3-4.csvWarning: non-fatal error reading adaptation data


Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

4.1 Visual Diagnostics of the Sample Validity

Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.

Code
parameter_subset <- c(
  "lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
  "mu[1]",     "mu[2]",     "mu[3]",     "mu[4]"
  )

pnbd_onlineretail_fixed3_stanfit$draws(inc_warmup = FALSE) |>
  mcmc_trace(pars = parameter_subset) +
  expand_limits(y = 0) +
  labs(
    x = "Iteration",
    y = "Value",
    title = "Traceplot of Sample of Lambda and Mu Values"
    ) +
  theme(axis.text.x = element_text(size = 10))

We want to check the \(N_{eff}\) statistics also.

Code
pnbd_onlineretail_fixed3_stanfit |>
  neff_ratio(pars = c("lambda", "mu")) |>
  as.numeric() |>
  mcmc_neff() +
    ggtitle("Plot of Parameter Effective Sample Sizes")

4.2 Assess the Model

As we intend to run the same logic to assess each of our models, we have combined all this logic into a single function run_model_assessment, to run the simulations and combine the datasets.

Code
pnbd_stanfit <- pnbd_onlineretail_fixed3_stanfit |>
  recover_types(customer_fit_stats_tbl)

pnbd_onlineretail_fixed3_assess_data_lst <- run_model_assessment(
  model_stanfit    = pnbd_stanfit,
  insample_tbl     = customer_fit_subset_tbl,
  fit_label        = "pnbd_onlineretail_fixed3",
  fit_end_dttm     = use_fit_end_date     |> as.POSIXct(),
  valid_start_dttm = use_valid_start_date |> as.POSIXct(),
  valid_end_dttm   = use_valid_end_date   |> as.POSIXct(),
  sim_seed         = 30
  )

pnbd_onlineretail_fixed3_assess_data_lst |> glimpse()
List of 5
 $ model_fit_index_filepath     : 'glue' chr "data/pnbd_onlineretail_fixed3_assess_fit_index_tbl.rds"
 $ model_valid_index_filepath   : 'glue' chr "data/pnbd_onlineretail_fixed3_assess_valid_index_tbl.rds"
 $ model_simstats_filepath      : 'glue' chr "data/pnbd_onlineretail_fixed3_assess_model_simstats_tbl.rds"
 $ model_fit_simstats_filepath  : 'glue' chr "data/pnbd_onlineretail_fixed3_assess_fit_simstats_tbl.rds"
 $ model_valid_simstats_filepath: 'glue' chr "data/pnbd_onlineretail_fixed3_assess_valid_simstats_tbl.rds"

4.2.1 Check In-Sample Data Validation

We first check the model against the in-sample data.

Code
simdata_tbl <- pnbd_onlineretail_fixed3_assess_data_lst |>
  use_series(model_fit_simstats_filepath) |>
  read_rds()

insample_plots_lst <- create_model_assessment_plots(
  obsdata_tbl = obs_fitdata_tbl,
  simdata_tbl = simdata_tbl
  )

insample_plots_lst$multi_plot |> print()

Code
insample_plots_lst$total_plot |> print()

Code
insample_plots_lst$quant_plot |> print()

This fit looks reasonable and appears to capture most of the aspects of the data used to fit it. Given that this is a synthetic dataset, this is not surprising, but at least we appreciate that our model is valid.

4.2.2 Check Out-of-Sample Data Validation

We now repeat for the out-of-sample data.

Code
simdata_tbl <- pnbd_onlineretail_fixed3_assess_data_lst |>
  use_series(model_valid_simstats_filepath) |>
  read_rds()

outsample_plots_lst <- create_model_assessment_plots(
  obsdata_tbl = obs_validdata_tbl,
  simdata_tbl = simdata_tbl
  )

outsample_plots_lst$multi_plot |> print()

Code
outsample_plots_lst$total_plot |> print()

Code
outsample_plots_lst$quant_plot |> print()

            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   3918301  209.3    8191591  437.5   8191591  437.5
Vcells 172121626 1313.2  466434868 3558.7 583043363 4448.3

5 Fit Narrow-Short-Lifetime Model

We now want to try a model where we use priors with a tighter coefficient of variation for lifetime but keep the CoV for transaction frequency.

Code
stan_modelname <- "pnbd_onlineretail_fixed4"
stanfit_seed   <- stanfit_seed + 1
stanfit_prefix <- str_c("fit_", stan_modelname) 

stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")


stan_data_lst <- customer_fit_stats_tbl |>
  select(customer_id, x, t_x, T_cal) |>
  compose_data(
    lambda_mn = 0.25,
    lambda_cv = 1.00,
    
    mu_mn     = 0.20,
    mu_cv     = 0.30,
    )

if(!file_exists(stanfit_object_file)) {
  pnbd_onlineretail_fixed4_stanfit <- pnbd_fixed_stanmodel$sample(
    data            =                stan_data_lst,
    chains          =                            4,
    iter_warmup     =                          500,
    iter_sampling   =                          500,
    seed            =                 stanfit_seed,
    save_warmup     =                         TRUE,
    output_dir      =                stan_modeldir,
    output_basename =               stanfit_prefix,
    )
  
  pnbd_onlineretail_fixed4_stanfit$save_object(stanfit_object_file, compress = "gzip")

} else {
  pnbd_onlineretail_fixed4_stanfit <- read_rds(stanfit_object_file)
}
Running MCMC with 4 chains, at most 12 in parallel...

Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 1 finished in 203.3 seconds.
Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 4 finished in 205.2 seconds.
Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 3 finished in 212.2 seconds.
Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 2 finished in 230.4 seconds.

All 4 chains finished successfully.
Mean chain execution time: 212.8 seconds.
Total execution time: 230.7 seconds.
Code
pnbd_onlineretail_fixed4_stanfit$summary()
# A tibble: 12,718 × 10
   variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
   <chr>         <num>    <num>   <num>   <num>    <num>    <num> <num>    <num>
 1 lp__       -1.85e+5 -1.85e+5 73.2    72.6    -1.85e+5 -1.85e+5 1.01      508.
 2 lambda[1]   8.12e-2  6.75e-2  0.0574  0.0499  1.64e-2  1.92e-1 1.00     2576.
 3 lambda[2]   1.62e-1  1.09e-1  0.182   0.115   4.14e-3  5.07e-1 1.00     1526.
 4 lambda[3]   1.61e-1  1.02e-1  0.183   0.109   6.79e-3  5.06e-1 1.00     2034.
 5 lambda[4]   6.06e-2  5.05e-2  0.0427  0.0379  1.06e-2  1.45e-1 1.00     2772.
 6 lambda[5]   2.37e-1  1.65e-1  0.231   0.170   1.36e-2  7.13e-1 1.00     2523.
 7 lambda[6]   3.02e-1  2.53e-1  0.205   0.182   5.74e-2  6.85e-1 0.999    2449.
 8 lambda[7]   1.61e-1  1.09e-1  0.172   0.114   7.40e-3  5.07e-1 1.00     1919.
 9 lambda[8]   1.65e-1  1.05e-1  0.182   0.114   7.23e-3  5.39e-1 1.00     2233.
10 lambda[9]   2.70e-1  2.42e-1  0.152   0.135   7.81e-2  5.74e-1 1.00     2380.
# ℹ 12,708 more rows
# ℹ 1 more variable: ess_tail <num>

We have some basic HMC-based validity statistics we can check.

Code
pnbd_onlineretail_fixed4_stanfit$cmdstan_diagnose()
Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed4-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed4-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed4-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_onlineretail_fixed4-4.csvWarning: non-fatal error reading adaptation data


Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

5.1 Visual Diagnostics of the Sample Validity

Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.

Code
parameter_subset <- c(
  "lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
  "mu[1]",     "mu[2]",     "mu[3]",     "mu[4]"
  )

pnbd_onlineretail_fixed4_stanfit$draws(inc_warmup = FALSE) |>
  mcmc_trace(pars = parameter_subset) +
  expand_limits(y = 0) +
  labs(
    x = "Iteration",
    y = "Value",
    title = "Traceplot of Sample of Lambda and Mu Values"
    ) +
  theme(axis.text.x = element_text(size = 10))

We want to check the \(N_{eff}\) statistics also.

Code
pnbd_onlineretail_fixed4_stanfit |>
  neff_ratio(pars = c("lambda", "mu")) |>
  as.numeric() |>
  mcmc_neff() +
    ggtitle("Plot of Parameter Effective Sample Sizes")

5.2 Assess the Model

As we intend to run the same logic to assess each of our models, we have combined all this logic into a single function run_model_assessment, to run the simulations and combine the datasets.

Code
pnbd_stanfit <- pnbd_onlineretail_fixed4_stanfit |>
  recover_types(customer_fit_stats_tbl)

pnbd_onlineretail_fixed4_assess_data_lst <- run_model_assessment(
  model_stanfit    = pnbd_stanfit,
  insample_tbl     = customer_fit_subset_tbl,
  fit_label        = "pnbd_onlineretail_fixed4",
  fit_end_dttm     = use_fit_end_date     |> as.POSIXct(),
  valid_start_dttm = use_valid_start_date |> as.POSIXct(),
  valid_end_dttm   = use_valid_end_date   |> as.POSIXct(),
  sim_seed         = 40
  )

pnbd_onlineretail_fixed4_assess_data_lst |> glimpse()
List of 5
 $ model_fit_index_filepath     : 'glue' chr "data/pnbd_onlineretail_fixed4_assess_fit_index_tbl.rds"
 $ model_valid_index_filepath   : 'glue' chr "data/pnbd_onlineretail_fixed4_assess_valid_index_tbl.rds"
 $ model_simstats_filepath      : 'glue' chr "data/pnbd_onlineretail_fixed4_assess_model_simstats_tbl.rds"
 $ model_fit_simstats_filepath  : 'glue' chr "data/pnbd_onlineretail_fixed4_assess_fit_simstats_tbl.rds"
 $ model_valid_simstats_filepath: 'glue' chr "data/pnbd_onlineretail_fixed4_assess_valid_simstats_tbl.rds"

5.2.1 Check In-Sample Data Validation

We first check the model against the in-sample data.

Code
simdata_tbl <- pnbd_onlineretail_fixed4_assess_data_lst |>
  use_series(model_fit_simstats_filepath) |>
  read_rds()

insample_plots_lst <- create_model_assessment_plots(
  obsdata_tbl = obs_fitdata_tbl,
  simdata_tbl = simdata_tbl
  )

insample_plots_lst$multi_plot |> print()

Code
insample_plots_lst$total_plot |> print()

Code
insample_plots_lst$quant_plot |> print()

This fit looks reasonable and appears to capture most of the aspects of the data used to fit it. Given that this is a synthetic dataset, this is not surprising, but at least we appreciate that our model is valid.

5.2.2 Check Out-of-Sample Data Validation

We now repeat for the out-of-sample data.

Code
simdata_tbl <- pnbd_onlineretail_fixed4_assess_data_lst |>
  use_series(model_valid_simstats_filepath) |>
  read_rds()

outsample_plots_lst <- create_model_assessment_plots(
  obsdata_tbl = obs_validdata_tbl,
  simdata_tbl = simdata_tbl
  )

outsample_plots_lst$multi_plot |> print()

Code
outsample_plots_lst$total_plot |> print()

Code
outsample_plots_lst$quant_plot |> print()

            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   3920552  209.4    8191591  437.5   8191591  437.5
Vcells 223086023 1702.1  671842209 5125.8 668403113 5099.6

6 Compare Model Outputs

We have looked at each of the models individually, but it is also worth looking at each of the models as a group.

Code
calculate_simulation_statistics <- function(file_rds) {
  simdata_tbl <- read_rds(file_rds)
  
  multicount_cust_tbl <- simdata_tbl |>
    filter(sim_tnx_count > 0) |>
    count(draw_id, name = "multicust_count")
  
  totaltnx_data_tbl <- simdata_tbl |>
    count(draw_id, wt = sim_tnx_count, name = "simtnx_count")
  
  simstats_tbl <- multicount_cust_tbl |>
    inner_join(totaltnx_data_tbl, by = "draw_id")
  
  return(simstats_tbl)
}
Code
obs_fit_customer_count <- obs_fitdata_tbl |>
  filter(tnx_count > 0) |>
  nrow()

obs_valid_customer_count <- obs_validdata_tbl |>
  filter(tnx_count > 0) |>
  nrow()

obs_fit_total_count <- obs_fitdata_tbl |>
  pull(tnx_count) |>
  sum()

obs_valid_total_count <- obs_validdata_tbl |>
  pull(tnx_count) |>
  sum()

obs_stats_tbl <- tribble(
  ~assess_type, ~name,               ~obs_value,
  "fit",        "multicust_count",   obs_fit_customer_count,
  "fit",        "simtnx_count",      obs_fit_total_count,
  "valid",      "multicust_count",   obs_valid_customer_count,
  "valid",      "simtnx_count",      obs_valid_total_count
  )

model_assess_tbl <- dir_ls("data", regexp = "pnbd_onlineretail_fixed.*_assess_.*simstats") |>
  enframe(name = NULL, value = "file_path") |>
  filter(str_detect(file_path, "_assess_model_", negate = TRUE)) |>
  mutate(
    model_label = str_replace(file_path, "data/pnbd_onlineretail_(.*?)_assess_.*", "\\1"),
    assess_type = if_else(str_detect(file_path, "_assess_fit_"), "fit", "valid"),
    
    sim_data = map(
      file_path, calculate_simulation_statistics,
      
      .progress = "calculate_simulation_statistics"
      )
    )

model_assess_tbl |> glimpse()
Rows: 8
Columns: 4
$ file_path   <fs::path> "data/pnbd_onlineretail_fixed1_assess_fit_simstats_tb…
$ model_label <chr> "fixed1", "fixed1", "fixed2", "fixed2", "fixed3", "fixed3"…
$ assess_type <chr> "fit", "valid", "fit", "valid", "fit", "valid", "fit", "va…
$ sim_data    <list> [<tbl_df[2000 x 3]>], [<tbl_df[2000 x 3]>], [<tbl_df[2000…

We have now constructed the simulation summary statistics and now reshape our data to aid in our model assessment.

Code
model_assess_summstat_tbl <- model_assess_tbl |>
  select(model_label, assess_type, sim_data) |>
  unnest(sim_data) |>
  pivot_longer(
    cols = !c(model_label, assess_type, draw_id)
    ) |>
  group_by(model_label, assess_type, name) |>
  summarise(
    .groups = "drop",
    
    mean_val = mean(value),
    p10 = quantile(value, 0.10),
    p25 = quantile(value, 0.25),
    p50 = quantile(value, 0.50),
    p75 = quantile(value, 0.75),
    p90 = quantile(value, 0.90)
    )

model_assess_summstat_tbl |> glimpse()
Rows: 16
Columns: 9
$ model_label <chr> "fixed1", "fixed1", "fixed1", "fixed1", "fixed2", "fixed2"…
$ assess_type <chr> "fit", "fit", "valid", "valid", "fit", "fit", "valid", "va…
$ name        <chr> "multicust_count", "simtnx_count", "multicust_count", "sim…
$ mean_val    <dbl> 1219.1055, 5958.6265, 901.8570, 9348.8505, 1291.7110, 4885…
$ p10         <dbl> 1194.0, 5723.0, 878.9, 8817.9, 1265.0, 4673.0, 662.9, 3765…
$ p25         <dbl> 1206.00, 5842.00, 890.00, 9083.00, 1279.00, 4773.00, 672.0…
$ p50         <dbl> 1219.0, 5953.0, 902.0, 9348.0, 1292.0, 4890.0, 683.0, 4048…
$ p75         <dbl> 1232.00, 6079.00, 914.00, 9624.00, 1306.00, 4992.00, 695.0…
$ p90         <dbl> 1244.0, 6196.0, 924.0, 9863.1, 1317.0, 5098.0, 705.0, 4352…

We now use this data to construct model comparison plots for the different models we have fit.

Code
#! echo: TRUE

ggplot(model_assess_summstat_tbl) +
  geom_errorbar(
    aes(x = model_label, ymin = p10, ymax = p90), width = 0
    ) +
  geom_errorbar(
    aes(x = model_label, ymin = p25, ymax = p75), width = 0, linewidth = 3
    ) +
  geom_hline(
    aes(yintercept = obs_value),
    data = obs_stats_tbl, colour = "red"
    ) +
  scale_y_continuous(labels = label_comma()) +
  expand_limits(y = 0) +
  facet_wrap(
    vars(assess_type, name), scale = "free_y"
    ) +
  labs(
    x = "Model",
    y = "Count",
    title = "Comparison Plot for the Different Models"
    ) +
  theme(
    axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8)
    )

6.1 Write Assessment Data to Disk

We now want to save the assessment data to disk.

Code
model_assess_tbl |> write_rds("data/assess_data_pnbd_onlineretail_fixed_tbl.rds")

7 R Environment

Code
options(width = 120L)
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       Ubuntu 22.04.3 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Dublin
 date     2023-10-24
 pandoc   3.1.1 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package        * version    date (UTC) lib source
 abind            1.4-5      2016-07-21 [1] RSPM (R 4.3.1)
 arrayhelpers     1.1-0      2020-02-04 [1] RSPM (R 4.3.1)
 backports        1.4.1      2021-12-13 [1] RSPM (R 4.3.0)
 base64enc        0.1-3      2015-07-28 [1] RSPM (R 4.3.0)
 bayesplot      * 1.10.0     2022-11-16 [1] RSPM (R 4.3.1)
 bit              4.0.5      2022-11-15 [1] RSPM (R 4.3.0)
 bit64            4.0.5      2020-08-30 [1] RSPM (R 4.3.0)
 bridgesampling   1.1-2      2021-04-16 [1] RSPM (R 4.3.1)
 brms           * 2.20.4     2023-09-25 [1] RSPM (R 4.3.1)
 Brobdingnag      1.2-9      2022-10-19 [1] RSPM (R 4.3.1)
 cachem           1.0.8      2023-05-01 [1] RSPM (R 4.3.0)
 callr            3.7.3      2022-11-02 [1] RSPM (R 4.3.0)
 checkmate        2.2.0      2023-04-27 [1] RSPM (R 4.3.1)
 cli              3.6.1      2023-03-23 [1] RSPM (R 4.3.0)
 cmdstanr       * 0.6.0.9000 2023-10-17 [1] Github (stan-dev/cmdstanr@a13c798)
 coda             0.19-4     2020-09-30 [1] RSPM (R 4.3.1)
 codetools        0.2-19     2023-02-01 [2] CRAN (R 4.3.1)
 colorspace       2.1-0      2023-01-23 [1] RSPM (R 4.3.0)
 colourpicker     1.3.0      2023-08-21 [1] RSPM (R 4.3.1)
 conflicted     * 1.2.0      2023-02-01 [1] RSPM (R 4.3.1)
 cowplot        * 1.1.1      2020-12-30 [1] RSPM (R 4.3.1)
 crayon           1.5.2      2022-09-29 [1] RSPM (R 4.3.0)
 crosstalk        1.2.0      2021-11-04 [1] RSPM (R 4.3.1)
 data.table       1.14.8     2023-02-17 [1] RSPM (R 4.3.0)
 digest           0.6.33     2023-07-07 [1] RSPM (R 4.3.0)
 directlabels   * 2023.8.25  2023-09-01 [1] RSPM (R 4.3.1)
 distributional   0.3.2      2023-03-22 [1] RSPM (R 4.3.1)
 dplyr          * 1.1.3      2023-09-03 [1] RSPM (R 4.3.0)
 DT               0.30       2023-10-05 [1] RSPM (R 4.3.1)
 dygraphs         1.1.1.6    2018-07-11 [1] RSPM (R 4.3.1)
 ellipsis         0.3.2      2021-04-29 [1] RSPM (R 4.3.0)
 evaluate         0.21       2023-05-05 [1] RSPM (R 4.3.0)
 fansi            1.0.4      2023-01-22 [1] RSPM (R 4.3.0)
 farver           2.1.1      2022-07-06 [1] RSPM (R 4.3.0)
 fastmap          1.1.1      2023-02-24 [1] RSPM (R 4.3.0)
 forcats        * 1.0.0      2023-01-29 [1] RSPM (R 4.3.0)
 fs             * 1.6.3      2023-07-20 [1] RSPM (R 4.3.1)
 furrr          * 0.3.1      2022-08-15 [1] RSPM (R 4.3.1)
 future         * 1.33.0     2023-07-01 [1] RSPM (R 4.3.1)
 generics         0.1.3      2022-07-05 [1] RSPM (R 4.3.0)
 ggdist           3.3.0      2023-05-13 [1] RSPM (R 4.3.1)
 ggplot2        * 3.4.3      2023-08-14 [1] RSPM (R 4.3.0)
 globals          0.16.2     2022-11-21 [1] RSPM (R 4.3.1)
 glue           * 1.6.2      2022-02-24 [1] RSPM (R 4.3.0)
 gridExtra        2.3        2017-09-09 [1] RSPM (R 4.3.1)
 gtable           0.3.4      2023-08-21 [1] RSPM (R 4.3.0)
 gtools           3.9.4      2022-11-27 [1] RSPM (R 4.3.1)
 hms              1.1.3      2023-03-21 [1] RSPM (R 4.3.0)
 htmltools        0.5.6      2023-08-10 [1] RSPM (R 4.3.0)
 htmlwidgets      1.6.2      2023-03-17 [1] RSPM (R 4.3.0)
 httpuv           1.6.11     2023-05-11 [1] RSPM (R 4.3.0)
 igraph           1.5.1      2023-08-10 [1] RSPM (R 4.3.1)
 inline           0.3.19     2021-05-31 [1] RSPM (R 4.3.1)
 jsonlite         1.8.7      2023-06-29 [1] RSPM (R 4.3.0)
 knitr            1.44       2023-09-11 [1] RSPM (R 4.3.0)
 labeling         0.4.3      2023-08-29 [1] RSPM (R 4.3.0)
 later            1.3.1      2023-05-02 [1] RSPM (R 4.3.0)
 lattice          0.21-8     2023-04-05 [2] CRAN (R 4.3.1)
 lifecycle        1.0.3      2022-10-07 [1] RSPM (R 4.3.0)
 listenv          0.9.0      2022-12-16 [1] RSPM (R 4.3.1)
 lobstr         * 1.1.2      2022-06-22 [1] RSPM (R 4.3.1)
 loo              2.6.0      2023-03-31 [1] RSPM (R 4.3.1)
 lubridate      * 1.9.2      2023-02-10 [1] RSPM (R 4.3.0)
 magrittr       * 2.0.3      2022-03-30 [1] RSPM (R 4.3.0)
 markdown         1.9        2023-09-30 [1] RSPM (R 4.3.1)
 Matrix           1.5-4.1    2023-05-18 [2] CRAN (R 4.3.1)
 matrixStats      1.0.0      2023-06-02 [1] RSPM (R 4.3.1)
 memoise          2.0.1      2021-11-26 [1] RSPM (R 4.3.0)
 mime             0.12       2021-09-28 [1] RSPM (R 4.3.0)
 miniUI           0.1.1.1    2018-05-18 [1] RSPM (R 4.3.0)
 munsell          0.5.0      2018-06-12 [1] RSPM (R 4.3.0)
 mvtnorm          1.2-3      2023-08-25 [1] RSPM (R 4.3.1)
 nlme             3.1-162    2023-01-31 [2] CRAN (R 4.3.1)
 parallelly       1.36.0     2023-05-26 [1] RSPM (R 4.3.1)
 pillar           1.9.0      2023-03-22 [1] RSPM (R 4.3.0)
 pkgbuild         1.4.2      2023-06-26 [1] RSPM (R 4.3.0)
 pkgconfig        2.0.3      2019-09-22 [1] RSPM (R 4.3.0)
 plyr             1.8.9      2023-10-02 [1] RSPM (R 4.3.1)
 posterior      * 1.4.1      2023-03-14 [1] RSPM (R 4.3.1)
 prettyunits      1.2.0      2023-09-24 [1] RSPM (R 4.3.0)
 processx         3.8.2      2023-06-30 [1] RSPM (R 4.3.0)
 promises         1.2.1      2023-08-10 [1] RSPM (R 4.3.0)
 ps               1.7.5      2023-04-18 [1] RSPM (R 4.3.0)
 purrr          * 1.0.2      2023-08-10 [1] RSPM (R 4.3.0)
 quadprog         1.5-8      2019-11-20 [1] RSPM (R 4.3.1)
 QuickJSR         1.0.6      2023-09-12 [1] RSPM (R 4.3.1)
 R6               2.5.1      2021-08-19 [1] RSPM (R 4.3.0)
 Rcpp           * 1.0.11     2023-07-06 [1] RSPM (R 4.3.0)
 RcppParallel     5.1.7      2023-02-27 [1] RSPM (R 4.3.1)
 readr          * 2.1.4      2023-02-10 [1] RSPM (R 4.3.0)
 reshape2         1.4.4      2020-04-09 [1] RSPM (R 4.3.1)
 rlang          * 1.1.1      2023-04-28 [1] RSPM (R 4.3.0)
 rmarkdown        2.25       2023-09-18 [1] RSPM (R 4.3.0)
 rstan            2.26.23    2023-09-08 [1] RSPM (R 4.3.1)
 rstantools       2.3.1.1    2023-07-18 [1] RSPM (R 4.3.1)
 rstudioapi       0.15.0     2023-07-07 [1] RSPM (R 4.3.0)
 rsyslog        * 1.0.3      2023-05-08 [1] RSPM (R 4.3.1)
 scales         * 1.2.1      2022-08-20 [1] RSPM (R 4.3.0)
 sessioninfo      1.2.2      2021-12-06 [1] RSPM (R 4.3.1)
 shiny            1.7.5      2023-08-12 [1] RSPM (R 4.3.0)
 shinyjs          2.1.0      2021-12-23 [1] RSPM (R 4.3.1)
 shinystan        2.6.0      2022-03-03 [1] RSPM (R 4.3.1)
 shinythemes      1.2.0      2021-01-25 [1] RSPM (R 4.3.1)
 StanHeaders      2.26.28    2023-09-07 [1] RSPM (R 4.3.1)
 stringi          1.7.12     2023-01-11 [1] RSPM (R 4.3.0)
 stringr        * 1.5.0      2022-12-02 [1] RSPM (R 4.3.0)
 svUnit           1.0.6      2021-04-19 [1] RSPM (R 4.3.1)
 tensorA          0.36.2     2020-11-19 [1] RSPM (R 4.3.1)
 threejs          0.3.3      2020-01-21 [1] RSPM (R 4.3.1)
 tibble         * 3.2.1      2023-03-20 [1] RSPM (R 4.3.0)
 tidybayes      * 3.0.6      2023-08-12 [1] RSPM (R 4.3.1)
 tidyr          * 1.3.0      2023-01-24 [1] RSPM (R 4.3.0)
 tidyselect       1.2.0      2022-10-10 [1] RSPM (R 4.3.0)
 tidyverse      * 2.0.0      2023-02-22 [1] RSPM (R 4.3.0)
 timechange       0.2.0      2023-01-11 [1] RSPM (R 4.3.0)
 tzdb             0.4.0      2023-05-12 [1] RSPM (R 4.3.0)
 utf8             1.2.3      2023-01-31 [1] RSPM (R 4.3.0)
 vctrs            0.6.3      2023-06-14 [1] RSPM (R 4.3.0)
 vroom            1.6.3      2023-04-28 [1] RSPM (R 4.3.0)
 withr            2.5.0      2022-03-03 [1] RSPM (R 4.3.0)
 xfun             0.40       2023-08-09 [1] RSPM (R 4.3.0)
 xtable           1.8-4      2019-04-21 [1] RSPM (R 4.3.0)
 xts              0.13.1     2023-04-16 [1] RSPM (R 4.3.1)
 yaml             2.3.7      2023-01-23 [1] RSPM (R 4.3.0)
 zoo              1.8-12     2023-04-13 [1] RSPM (R 4.3.1)

 [1] /usr/local/lib/R/site-library
 [2] /usr/local/lib/R/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Code
options(width = 80L)